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Abstract 

For big data analysis, high computational cost for Bayesian methods often limits their ap¬ 
plications in practice. In recent years, there have been many attempts to improve compu¬ 
tational efficiency of Bayesian inference. Here we propose an efficient and scalable com¬ 
putational technique for a state-of-the-art Markov Chain Monte Carlo (MCMC) methods, 
namely, Hamiltonian Monte Carlo (HMC). The key idea is to explore and exploit the struc¬ 
ture and regularity in parameter space for the underlying probabilistic model to construct 
an effective approximation of its geometric properties. To this end, we build a surrogate 
function to approximate the target distribution using properly chosen random bases and 
an efficient optimization process. The resulting method provides a flexible, scalable, and 
efficient sampling algorithm, which converges to the correct target distribution. We show 
that by choosing the basis functions and optimization process differently, our method can 
be related to other approaches for the construction of surrogate functions such as gener¬ 
alized additive models or Gaussian process models. Experiments based on simulated and 
real data show that our approach leads to substantially more efficient sampling algorithms 
compared to existing state-of-the art methods. 


1 Introduction 

Bayesian statistics has provided a principled and robust framework to create many important and power¬ 
ful data analysis methods over the past several decades. Given a probabilistic model for the underlying 
mechanism of observed data, Bayesian methods properly quantify uncertainty and reveal the landscape or 
global structure of the parameter space. However, these methods tend to be computationally intensive since 
Bayesian inference usually requires the use of Markov Chain Monte Carlo (MCMC) algorithms to simulate 
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samples from intractable distributions. Although the simple Metropolis algorithm H] is often effective at 
exploring low-dimensional distributions, it can be very inefficient for complex, high-dimensional distribu¬ 
tions: successive states may exhibit high autocorrelation, due to the random walk nature of the movement. 
As a result, the effective sample size tends to be quite low and the convergence to the true distribution is 
usually very slow. The celebrated Hamiltonian Monte Carlo (HMC) ElIU reduces the random walk behavior 
of Metropolis by Hamiltonian dynamics, which uses gradient information to propose states that are distant 
from the current state, but nevertheless have a high probability of acceptance. 

Although HMC explores the parameter space more efficiently than random walk Metropolis does, it does 
not fully exploit the structure (i.e., geometric properties) of parameter space 0 since dynamics are defined 
over Euclidean space. To address this issue, Girolami and Calderhead 0 proposed a new method, called 
Riemannian Manifold HMC (RMHMC), that uses the Riemannian geometry of the parameter space ||6l to 
improve standard HMC’s efficiency by automatically adapting to local structures. 

To make such geometrically motivated methods practical for big data analysis, one needs to combine them 
with efficient and scalable computational techniques. A common bottleneck for using such sampling algo¬ 
rithms for big data analysis is repetitive evaluations of functions, their derivatives, geometric and statistical 
quantities that involves the whole observed data and maybe a complicated model. A natural question is how 
to construct effective approximation of these quantities that provides a good balance between accuracy and 
computation cost. One common approach is subsampling (see, for example, lIlIHlIllIIol), which restricts 
the computation to a subset of the observed data. This is based on the idea that big datasets contain a large 
amount of redundancy so the overall information can be retrieved from a small subset. However, in general 
applications, we cannot simply use random subsets for this purpose: the amount of information we lose as a 
result of random sampling leads to non-ignorable loss of accuracy, which in turn has a substantially negative 
impact on computational efficiency m. Therefore, in practice, it is a challenge to hnd good criteria and 
strategies for an appropriate and effective subsampling. 

Another approach is to exploit smoothness or regularity in parameter space, which is true for most statistical 
models. This way, one could hnd computationally cheaper surrogate functions to substitute the expensive 
target (or potential energy) functions msi [Tbilisi EH [m El. However, the usefulness of these methods is 
often limited to moderate dimensional problems because of the computational cost needed to achieve desired 
approximation accuracy. 

In this work, our objective is to develop a faster alternative to the method of M- To this end, we propose 
to use random nonlinear bases along with efficient learning algorithms to construct a surrogate functions 
that provides effective approximation of the probabilistic model based on the collective behavior of the 
large data. The randomized nonlinear basis functions combined with the computationally efficient learning 
process can incorporate correct criteria for an efficient implicit subsampling resulting in both hexible and 
scalable approximation lE9l[^[3ni^ . Because our method can be presented as a special case of shallow 
random networks implemented in HMC, we refer to it as random network surrogate function', however, we 
will show that our proposed method is related to (and can be extended to) other surrogate functions such 
as generalized additive models and Gaussian process models by constructing the surrogate functions using 
different bases and optimization process. 

Our proposed method provides a natural framework to incorporate surrogate functions in the sampling algo¬ 
rithms such as HMC, and it can be easily extended to geometrically motivated methods such as Riemannian 
Manifold HMC. Further, for problems with a limited time budget, we propose an adaptive version of our 
method that substantially reduces the required number of training points. This way, the random bases surro¬ 
gate function could be utilized earlier and its approximation accuracy could be improved adaptively as more 
training points become available. We show that theoretically the learning procedure for our surrogate func¬ 
tion is asymptotically equivalent to potential matching, which is itself a novel distribution matching strategy 
similar to the score matching method discussed in IITSIEOI . 
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Finally, we should emphasize that out method is used to generate high quality proposals at low computa¬ 
tional cost. However, when calculating the acceptance probability of these proposals, we use the original 
Hamiltonian (used in standard HMC) to ensure that the stationary distribution of the Markov chain will 
remain the correct target distribution. 

Our paper is organized as follows. An overview of HMC and RMHMC is given in Section]^ Our random 
network surrogate HMC is explained in detail in Section]^ The adaptive model is developed in Section]^ 
Experimental results based on simulated and real data are presented in Section [5] C ode for these examples 
is available at github. com/chengzhang-uci/RNSHMC Finally, Sectionis devoted to discussion 
and future work. As an interesting supplement, an overall description of the potential matching procedure is 
presented in Sectionj^in the appendix. 


2 Preliminaries 

2.1 Hamiltonian Monte Carlo 


In Bayesian Statistics, we are interested in sampling from the posterior distribution of the model parameters 
q given the observed data, Y = (j/i, 2 / 2 , ■ • •, Un)'^, 

P{q\Y) (xexp{-U{q)), (1) 

where the potential energy function U is defined as 

N 

U{q) = -Y,^og P{yM)-log P{q). (2) 

i=l 


The posterior distribution is almost always analytically intractable. Therefore, MCMC algorithms are typ¬ 
ically used for sampling from the posterior distribution to perform statistical inference. As the number of 
parameters increases, however, simple methods such as random walk Metropolis |[T| and Gibbs sampling 
0 may require a long time to converge to the target distribution. Moreover, their explorations of parameter 
space become slow due to inefficient random walk proposal-generating mechanisms, especially when there 
exist strong dependencies between parameters in the target distribution. By inclusion of geometric infor¬ 
mation from the target distribution, HMC ID ID introduces a Hamiltonian dynamics system with auxiliary 
momentum variables p to propose samples of 9 in a Metropolis framework that explores the parameter space 
more efficiently compared to standard random walk proposals. More specifically, HMC generates proposals 
jointly for q and p using the following system of differential equations: 

— = — Gi 

dt dpi 

d^^_dH 
dt dqi 

where the Hamiltonian function is defined as H{q,p) = U{q) + ^p^M~^p. The quadratic kinetic energy 
function K(j>) = ^p'^M~^p corresponds to the negative log-density of a zero-mean multivariate Gaussian 
distribution with the covariance M. Here, M is known as the mass matrix, which is often set to the identity 
matrix, I. Starting from the current state {q,p), the Hamiltonian dynamics system ([^,Q is simulated for L 
steps using the leapfrog method, with a stepsize of e. The proposed state, {q* ,p*), which is at the end of the 
trajectory, is accepted with probability min(l, exp[—//(g*,;?*) + H{q,p)]). By simulating the Hamiltonian 
dynamics system together with the correction step, HMC generates samples from a joint distribution 


P{q,p) oc exp 
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Algorithm 1: Hamiltonian Monte Carlo 

Input: Starting position and step size e 
for t = 1, 2, • • ■ do 

Resample momentum p 
pW (qo,Po) = (gW,p(‘)) 

Simulate discretization of Hamiltonian dynamics: 
for I = 1 to L do 

pi.i ^pi-i - 

qi qi-i + eM“V-i 

I Pi Pi - l%{<li) 

(9*,p*) = {qL,PL) 

Metropolis-Hasting correction: 
u ~ Uniform[0,1] 
p = exp[_H'(q(*\pf*^) - H{q\p*)] 
if u < min(l,p), then = q*\ 


Notice that q and p are separated, the marginal distribution of q then follows the target distribution. These 
steps are presented in Algorithm Following the dynamics of the assumed Hamiltonian system, HMC 
can generate distant proposals with high acceptance probability which allows an efficient exploration of 
parameter space. 


2.2 Riemannian Manifold HMC 


Although HMC explores the target distribution more efficiently than random walk Metropolis, it does not 
fully exploit the geometric structures of the underlying probabilistic model since a flat metric (i.e., M = I) 
is used. Using more geometrically motivated methods could substantially improves sampling algorithms’ 
efficiency. Recently, Girolami and Calderhead Q proposed a new method, called Riemannian Manifold 
HMC (RMHMC), that exploits the Riemannian geometry of the target distribution to improve standard 
HMC’s efficiency by automatically adapting to local structures. To this end, instead of the identity mass 
matrix commonly used in standard HMC, they use a position-specific mass matrix M = G{q). More 
specifically, they set G{q) to the Fisher information matrix, and define Hamiltonian as follows: 

H{q,p) = U{q) + i log det G(q) -f ^p'^G{q)~^p = (t>{q) + ]^p^G{q)~^p (5) 

where 4){q) := U{q)-\- \ log det G{q). Note that standard HMC is a special case of RMHMC with G{q) = I. 
Based on this dynamic, they propose the following HMC on Riemmanian manifold: 


q = VpH[q,p) = G{q) V 

p = -VqH{q,p) = 
where the ith element of the vector v{q,p) is 


(6) 


{v{q,p))i =-p^di{G{q) i)p=(G(g) ^pf d^G{q)G{q) ^p 
with the shorthand notation di = d/dqi for partial derivative. 

The above dynamic is non-separable (it contains products of q and p), and the resulting proposal generating 
mechanism based on the standard leapfrog method is neither time-reversible nor symplectic. Therefore, the 
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standard leapfrog algorithm cannot be used for the above dynamic El. Instead, we can use the Stomer-Veriet 
ca method, known as generalized leapfrog im, 


(t+1/2) ^ (t) _ e 

2 


1 


q{t+l) ^ q{t) 


e 

2 L 


(*+ 1 / 2 ) 


. 2 




p(t+l) = p(t+l/2) 




(7) 

( 8 ) 
(9) 


The resulting map is 1) deterministic, 2) reversible, and 3) volume-preserving. However, it requires solving 
two computationally intensive implicit equations (Equations (0 and 0) at each leapfrog step. 


3 Random Network Surrogate HMC (RNS-HMC) 

For HMC, the Hamiltonian dynamics contains the information from the target distribution through the po¬ 
tential energy U and its gradient. For RMHMC, more geometric structure (i.e., the Fisher information) is 
included through the mass matrix for kinetic energy. It is the inclusion of such information in the Hamil¬ 
tonian dynamics that allows HMC and RMHMC to improve upon random walk Metropolis. However, one 
common computational bottleneck for HMC and other Bayesian models for big data is repetitive evaluations 
of functions, their derivatives, geometric and statistical quantities. Typically, each evaluation involves the 
whole observed data. For example, one has to compute the potential U and its gradient from the equation 0 
for HMC and mass matrix M, its inverse and the involved partial derivatives for RMHMC at every time step 
or M-H correction step. When N is large, this can be extremely expensive to compute. In some problems, 
each evaluation may involve solving a computationally expensive problem. (See the inverse problem and 
Remark in Section [50) 

To alleviate this issue, in recent years several methods have been proposed to construct surrogate Hamiltoni¬ 
ans. For relatively low dimensional spaces, (sparse) grid based piecewise interpolative approximation using 
precomputing strategy was developed in im. Such grid based methods, however, are difficult to extend 
to high dimensional spaces due to the use of structured grids. Alternatively, we can use Gaussian process 
model, which are commonly used as surTogate models for emulating expensive-to-evaluate functions, to 
learn the target functions from early evaluations (training data) However, naive (but commonly 

used) implementations of Gaussian process models have high computation cost associated with inverting the 
covariance matrix, which grows cubically as the size of the training set increases. This is especially crucial in 
high dimensional spaces, where we need large training sets in order to achieve a reasonable level of accuracy. 
Recently, scalable Gaussian processes using induing point methods li23l l24ll have been introduced to scale 
up GPs to larger datasets. While these methods have been quite succsessful in reducing computation cost, 
the tuning of inducing points could still be problematic in high dimensional spaces. (See a more detailed 
discussion in Section[30) 

The key in developing surTogate functions is to develop a method that can effectively capture the collective 
properties of large datasets with scalability, flexibility and efficiency. In this work, we propose to construct 
surrogate functions using proper random nonlinear bases and efficient optimization process on training data. 
More specifically, we present our method as a special case of shallow neural networks; although, we show 
that it is related to (and can be extended to) other surTogate functions such as generalized additive models 
and Gaussian process models. Random networks of nonlinear functions prove capable of approximating a 
rich class of functions arbitrarily well ll^ (3^ . Using random nonlinear networks and algebraic learning 
algorithms can also be viewed as an effective implicit subsampling with desired criteria. The choice of 
hidden units (basis functions) and the fast learning process can be easily adapted to be problem specific 
and scalable. Unlike typical (naive) Gaussian process models, our random network scales linearly with the 
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number of training points. In fact, a random nonlinear network can be considered as a standard regression 
model with randomly mapped features. For such shallow random networks, the computational cost for 
inference is cubic in the number of hidden nodes. Those differences in scaling allow us to explicitly trade 
off computational efficiency and approximation accuracy and construct more efficient surrogate in certain 
applications. As our empirical results suggest, with appropriate training data good approximation of smooth 
functions in high dimensional space can be achieved using a moderate and scalable number of hidden units. 
Therefore, our proposed method has the potential to scale up to large data sizes and provide effective and 
scalable surrogate Hamiltonians that balance accuracy and efficiency well. 

3.1 Shallow Random Network Approximation 

A typical shallow network architecture (i.e., a single-hidden layer feedforward scalar-output neural network) 
with s hidden units, a nonlinear activation function a, and a scalar (for simplicity) output z for a given d- 
dimensional input q is dehned as 

S 

z{q) ='^Via{q;'yi) + b ( 10 ) 

i=l 

where 7 ^ is the /th hidden node parameter, Vi is the output weight for the /th hidden node, and b is the output 
bias. Given a training data set 

r= f(j') eK, j = l,...,N} 

the neural network can be trained by hnding the optimal model parameters W = {71, Vi, i = 1,..., s,b} 
to minimize the mean square error cost function, 

1 ^ 

= ( 11 ) 

The most popular algorithm in machine learning to optimize GU is back-propagation ca. However, as 
a gradient descent-based iterative algorithm, back-propagation is usually quite slow and can be trapped at 
some local minimum since the cost function is nonlinear, and for most cases, non-convex. Motivated by the 
fact that randomization is computationally cheaper than optimization, alternative methods based on random 
nonlinear bases have been proposed Ian EH. These methods drastically decrease the computation cost while 
maintaining a reasonable level of approximation accuracy. The key feature of random networks is that they 
reduce the full optimization problem into standard linear regression by mapping the input data to a random¬ 
ized feature space and then apply existing fast algebraic training methods (e.g., by minimizing squared error) 
to hnd the output weight. Given the design objective, algebraic training can achieve exact or approximate 
matching of the data at the training points. Compared to the gradient descent-based techniques, algebraic 
training methods can reduce computational complexity and provide better generalization properties. A typ¬ 
ical algebraic approach for single-hidden layer feedforward random networks is extreme learning machine 
(ELM) ||2^ . which is summarized in Algorithm]^ Using randomized nonlinear features, ELM estimates 
the output weight by hnding the least-squares solution to the resulting linear equations system Hv = T. 
Note that presented this way, our method can also be regarded as a random version of Generalized Additive 
Model (GAM). In practice, people could add regularization to improve stability and generalizability. 

3.2 Choice of Nonlinearity 

There are many choices for nonlinear activation functions in random networks. Different types of activation 
functions can be used for different learning tasks. In this paper, we focus on random networks with two 
typical types of nonlinear nodes; 
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Algorithm 2: Extreme Learning Machine 

Given a training set T = tj £ R"*, j = 1,..., N}, activation function a{x’, 7 ) and hidden node 

number s 

Step 1: Randomly assign hidden node parameters ')i,i = 1,..., s 
Step 2: Calculate the hidden layer output matrix H 

Hji = a(7j;7i), i = 1,... ,s, j = 1,... ,iV 
Step 3: Calculate the output weight v 

V = H^T, r = [fi, f2, • • •,fiv]^ 

where 77^ is the Moore-Penrose generalized inverse of matrix 77 


• Additive nodes: 

0(9; 7) = • 9 + d G K, 7 = {w, d} 

where w and d are the weight vector and the bias of the hidden node. 

• Radial basis functions (RBF) nodes: 

= a ’ cGK'^, 7 = {c,7} 

where c and i are the center and width of the hidden node. 

Both random networks can approximate a rich class of functions arbitrarily well ll^l^ . With randomly 
assigned input weights and biases composed linearly inside the nonlinear activation function, additive nodes 
form a set of basis functions, whose level sets are hyperplanes orientated by Wi and shifted by di respectively. 
Random networks with additive nodes tend to reflect the global structure of the target function. On the other 
hand, RBF nodes are almost compactly supported (can be adjusted by the width 7) rendering good local 
approximation for the corresponding random networks. 

3.3 Connection to GPs and Sparse GPs 

It is worth noting the connection between networks with RBF nodes and Gaussian processes models ||2T]|22l- 
Given a training data set 

r = G G K, j = 1,..., iV} 

and using squared exponential covariance function 

K{q<^o)^qU')) = ^2 exp , Q = {ay,7} 

the standard GP regression with a Gaussian noise model has the following marginal likelihood 

p{t\Q,e)=N{t\G,KN + cj‘^I) 

where Q = [KN]jj' = K, q^^ '> ) is the covariance matrix and a is the noise 

parameter. Prediction on a new observation q* is made according to the conditional distribution 

Pit*\q*,r, 0) = {Km + K,, - kj{K m + a^iy^k, + a^) 
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where = K{q^^\q*) and if** = K{q*,q*). 

On the other hand, if we use K{q^^'>, •) as the jth hidden node, j = 1,2,. .., N, the output matrix becomes 
H = Kfq, and the output weight learned by algebraic approach to a regularized least square problem is 

V = argmin \\Hv — t\\^ + a'^v^K = (if at + 

V 

Therefore, such a network provides the same prediction point estimate as the above full GP models. This 
way, a Gaussian process model can be interpreted as a self-organizing RBF network where new hidden 
nodes are added adaptively with new observations. This is also an alternative point of view to HI where 
GP models were shown to be equivalent to single hidden-layer neural networks with infinite many hidden 
nodes. 

Notice that the above GP model scales cubically with the number of data points N which limits the appli¬ 
cation of GPs to relatively small datasets. To derive a GP model that is computationally tractable for large 
datasets, sparse GPs based on inducing point methods ll^l24ll have been previously proposed. These sparse 
models introduce a set of inducing points Q = and approximate the exact kernel K{q,q') by 

an approximation K{q,q') for fast computation. For example, the fully independent training conditional 
(FITC) II23I method uses the approximate kernel 

KFiTciq,q') = 

where Km is. the exact covariance matrix for the M inducing points and kq, kqi are the exact covariance 
matrices between q, q' and the inducing points. Given the same training data set T, the marginal likelihood 
is 

p{t\Q, 9) = Af{t\0, KnmK]^Kmn + A + cr^ J) 

Here, A is a diagonal matrix with Ajj = Kjj — kjK'^jkj that adjusts the diagonal elements to the ones 
from the exact covariance matrix Kjsf. Simiarly, predictions can be made according to the conditional 
distribution 

p{t*\q*,T, 9) = N{e\klS-^K mn{^ + A** - fef + a^) 

where Sm = Km + Kmn{-^ + K^m- The computation cost is reduced to A) for learning 

and after that the predictive mean costs 0{M) per new observation. The hyperparameters 9, a and inducing 
points Q can be tuned to maximize the marginal likelihood. However, in high dimension the tuning of Q 
becomes infeasible. 

On the other hand, if we use the inducing points Q as the centers of a RBF network, the output matrix 
H = Kf^M- Given the diagonal matrix D = A + a'^I, the output weight estimated by the algebraic 
approach to a weighted least square problem plus a regularization term is 

V = argmin \\D~i{Hv - t)|p -f v'^Kmv = 

V 

Therefore, the same predictive mean can be reproduced if we use the inducing points as centers and use the 
same hyperparameter configuration in our random network with RBF nodes and optimize with respect to the 
above cost function. Moreover, those hyperparameters in each basis (or hidden nodes) can also be trained 
(typically by gradient decent) if we abandon the use of random bases and simple linear regression. 

Figure [T] compares random networks with different node types and related GP methods on fitting a simple 
function y = x^/2 which corresponds to the negative log-density of the standard normal distributions. We 
used softplus function (t(x) = log (l + exp(x)) in additive nodes and exponential square kernels in RBF 
nodes and GP methods. As we can see from the graph, random networks generally perform better than 
GP methods when the number of observations is small. The randomness in the configuration of hidden 
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N = 10 N = 20 N = 40 





Figure 1: Comparing different surrogate approximations with an increasing number of observations N = 
10,20,40 on target function y = 12. The observation points are nested samples from the standard normal 

distribution. For FITC and random networks, we choose 5 inducing points and 5 hidden neurons respectively. 
FITC and random networks are all run 100 times and averaged to reduce the random effects on the results. 

nodes force networks to learn more globally. In contrast, GP models are more local and need more data to 
generalize well. By introducing sparsity, FITC tends to generalize better than full GP, especially on small 
datasizes. Since our goal is to fit negative log-posterior density function in (j^ and U{q) —> oo as q moves 
away from the high density domain, using softplus basis functions in random networks are more capable to 
capture this far field feature by striking a better balance between flexibility and generalization while being 
less demanding on the datasize. Also, the number of hidden neurons (bases) can be used to regularize the 
approximation to mitigate overfltting issue. 

3.4 Surrogate Induced Hamiltonian Flow 

As mentioned in the previous sections, repetitive computation of Hamiltonian, its gradient and other quan¬ 
tities that involve all data set undermine the overall exploration efficiency of HMC. To alleviate this issue, 
we exploit the smoothness or regularity in parameter space, which is true for most statistical models. As 
discussed in dlBlIIll, one can improve computational efficiency of HMC by approximating the energy 
function and using the resulting approximation to device a surrogate transition mechanism while still con¬ 
verging to the correct target distribution. To this end, m proposed to use pre-convergence samples (which 
are discarded during the burn-in period) to approximate the energy function using a Gaussian process model. 
Here, we define an alternative surrogate-induced Hamiltonian as follows; 

H{q,p) = U{q) + 

where U{q) is the neural network surrogate function. H(q,p) now defines a surrogate-induced Hamiltonian 
flow, parametrized by a trajectory length t, which is a map (pt : {q,p) {q*,p*). Here, {q*,p*) being the 

end-point of the trajectory governed by the following equations 

rfp dii dU 
dt dp dt dq dq 

When the original potential U{q) is computationally costly, simulating the surrogate induced Hamiltonian 
system provides a more efficient proposing mechanism for our HMC sampler. The introduced bias along 
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Figure 2: Comparing HMC and NNS-HMC based on a 2-dimensional banana-shaped distribution. The 
left panel shows the gradient fields (force map) for the original Hamiltonian flow (red) and the surrogate 
induced Hamiltonian flow (blue). The middle and right panel show the trajectories for HMC and NNS- 
HMC samplers. Both samplers start from the same point (red square) with same initial momentums. Blue 
points at the end of the trajectories are the proposals. The overall acceptance probability drops from 0.97 
using HMC to 0.88 using NNS-HMC. 


with discretization error from the leap-frog integrator are all naturally corrected in the MH step where we use 
the original Hamiltonian in the computation of acceptance probability. As a result, the stationary distribution 
of the Markov chain will remain the correct target distribution (see Appendix [A| for a detailed proof). Note 
that by controlling the approximation quality of the surrogate function, we can maintain a relatively high 
acceptance probability. This is illustrated in Figure |^for a two-dimensional banana-shaped distribution Q. 

To construct such a surrogate, the early evaluations of the target function during the early iterations of 
MCMC will be used as the training set based on which we can train a shallow random network using fast 
algebraic approaches, such as ELM (Algorithm]^. The gradient of the scalar output z (see[T0|) for a network 
with additive hidden nodes, for example, then can be computed as 


dz 

dq 


S 

Viufwi ■ q + di)wi 

i=l 


( 12 ) 


which costs only 0{s) computations. To balance the efficiency in computation and flexibility in approxi¬ 
mation, and to reduce the possibility of overfitting, the number of hidden nodes s need to be small as long 
as a reasonable level of accuracy can be achieved. In practice, this can be done by monitoring the resulting 
acceptance rate using an initial chain. 

Following we propose to run our method, henceforth called random network surrogate Hamiltonian 
Monte Carlo (RNS-HMC, see Algorithm]^, in two phases; exploration phase and exploitation phase. During 
the exploration phase, we initialize the training data set D with an empty set or some samples from the 
prior distribution of parameters. We then run the standard HMC algorithm for some iterations and collect 
information from the new states (i.e., accepted proposals). When we have sufficiently explored the high 
density domain in parameter space and collected enough training data (during the bum-in period), a shallow 
random network is trained based on the collected training set D to form a surrogate for the potential energy 
function. The surrogate function will be used to approximate the gradient information needed for HMC 
simulations later in the exploitation phase. 
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Figure 3: Comparing the efficiency of our random network surrogates and Gaussian process surrogates on 
a challenging 32 dimensional Gaussian target whose covariance matrix has an eigenvector (1,1,...,!)^ 
with a corresponding eigenvalue of 1.0, and all other eigenvalues are 0.01. We set the step size to keep the 
acceptance probability around 70% for HMC and use the same step size in all surrogate methods. For FITC 
and random networks, the number of inducing points and hidden neurons are all set to be 1000 to allow 
reasonably accurate approximation. We ran each algorithm ten times and plot the medians and 80% error 
bars. 


As an illustrative example, we compare the performance of different surrogate HMC methods on a chal¬ 
lenging Gaussian target density in 32 dimensions (A lower dimensional case was used in ifTbl ). The target 
density has 31 confined directions and a main direction that is 10 times wider, and all variables are corre¬ 
lated. Both the full GPs and FITC methods are implemented using GPML package ini. The results are 
presented in Figure]^ Compared to the full GPs, FITC and random networks (with additive and RBF nodes) 
all scales linearly with the number of observations. Both random network surrogates can start with fewer 
training data. We also compare the efficiency of the surrogate induced Hamiltonian flows in terms of time 
normalized mean effective sample sizes (ESS). The efficiency of FITC and random networks all increases as 
the number of observations increase until no more approximation gain can be obtained (see the acceptance 
probability in the middle panel). However, the efficiency of full GP begin to drop before the model reaches 
its full capacity. That is because its predictive complexity also grows with the number of observations, which 
in turn diminishes the overall efficiency. Overall, the random network with additive nodes outperform other 
methods based on these example. 

Our proposed method provides a natural framework to incorporate surrogate functions in HMC. Moreover, 
it can be easily extended to RMHMC. To this end, the Hessian matrix of the surrogate function can be used 
to construct a metric in parameter space and the third order derivatives can be used to simulate the corre¬ 
sponding modified Hamiltonian flow. We refer to this extended version of our method as RNS-RMHMC. 

Note that the approximation quality of the neural network surrogate function depends on several factors 
including the dimension of parameter space, d, the number of hidden neurons, s and the training size, N. 
Here, we assume that N is sufficiently large enough, and investigate the efficiency of RNS-HMC in terms 
of its acceptance probability for different values of d and s based on a standard logistic regression model 
with simulated data. Similar to the results presented in ||20l. Figureshows the acceptance rate (over 10 
MCMC runs) as a function of d and s. For dimensions up to d « 50, RNS-HMC maintains a relatively 
high acceptance probability with a shallow random network trained in a few seconds on a laptop computer. 
Appendix [B|provides a theoretical justification for our method. 
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Algorithm 3: Random Network Surrogate HMC 

Input: Starting position step size e and number of hidden units s 
Initialize the training data set: D = 0 or several random samples from the prior 
for f = 1, 2, • • • , B do 
Resample momentum p 

Simulate discretization of Hamiltonian dynamics and propose {q* ,p*) 
Metropolis-Hasting correction: 

u ~ Uniform[0,1], p = eyzp[H{q^^\p^^^) — H{q*,p*)\ 
if u < min(l, p), then = q*, D = D U {{q*,U{q*))}', 

Train a neural network with s hidden units via ELM on D to form the surrogate function a 
for f = _B + 1, _B + 2, • • ■ do 
Resample momentum p 

~ A/'(0,M), {qo,po) = (q^^\p^^'’) 

Simulate discretization of a new Hamiltonian dynamics using z: 
for / = 1 to L do 

pi-i t- pi-i - I 

qi ^ qi-i + eM~^pi-i 

L 

{q*,p*) = iqL,PL) 

Metropolis-Hasting correction: 

u ~ Uniform[0,1], p = exp[H{q^*\p^*^) — H{q*,p*)] 
if u < min(l,p), then = g*; 


d = 32 s = 2000 



Figure 4: Acceptance probability of surrogate induced Hamiltonian flow on simulated logistic regression 
models for different number of parameters, d, and hidden neurons, s. The step size is chosen to keep the 
acceptance probability around 70% for HMC. Left: Acceptance probability as a function of s (x-axis) and 
d (y-axis). Middle: Acceptance probability as a function of s for a fixed dimension d = 32. Right: 
Acceptance probability as a function of d for a fixed s = 2000. 


4 Adaptive RNS-HMC 


So far, we have assumed that the neural network model in our method is trained using a sufficiently large 
enough number of training points after waiting for an adequate number of iterations to allow the sampler ex- 
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Figure 5: Acceptance probability of the surrogate induced Hamiltonian flow based on a simulated logistic 
regression models with dimension d = 32. Left: Acceptance probability as a function of the number of 
hidden neurons s (x-axis) and the number of training points N (y-axis). Middle: Acceptance probability 
as a function of N for a fixed s = 1000 . Right: Acceptance probability as a function of s for a fixed 
N = 1600 . 


plore the parameter space. This, however, could be very time consuming in practice: waiting for a long time 
to collect a large number of training points could undermine the benefits of using the surrogate Hamiltonian 
function. 

Figure 1^ shows the average acceptance probabilities (over 10 MCMC chains) as a function of the number 
of training points, N, and the number hidden neurons, s, on a simulated logistic regression model for a 
fixed number of parameters, d = 32. While it takes around 2000 training data points to fulfill the net¬ 
work’s capability and reach a high acceptance comparable to HMC, only 500 training points are enough to 
provide an acceptable surrogate Hamiltonian flow (around 0.1 acceptance probability). Therefore, we can 
start training the neural network surrogate earlier and adapting it as more training points become available. 
Although adapting a Markov chain based on its history may undermine its ergodicity and consequently its 
convergence to the target distribution If3^ . we can enforce a vanishing adaption rate at such that ot —0 and 
a* = 00 and update the surrogate function with probability at at iteration t. By Theorem 5 of llJTl . 
the resulting algorithm is ergodic and converges to the right target distribution. 

It is straightforward to admt the neural network surrogate from the history of Markov chain. However, 
the estimator in Algorithm^ costs 0{Nds -f Ns"^ + s^) computation and 0{Ns) storage, where N is the 
number of training data (e.g., the number of rows in the output matrix H). As N increases, finding the right 
weight and bias for the output neuron becomes increasingly difficult. In ll38]l . Grevillle shows that W can be 
learned incrementally in real time as new data points become available. Based on Greville’s method, online 
and adaptive pseudoinverse solutions for updating ELM weights has been proposed in which can be 
readily employed here to develop an adaptive version of RNS-HMC. To be more efficient, only the estimator 
is updated. 

Proposition 1 Suppose the current output matrix is Hj. and the target vector is T^. At time k 1, a new 
sample qk+i and the target (potential) tk+i are collected. Denote the output vector from the hidden layer at 
qk+i as hk+i- The adaptive updating formula for the empirical potential matching estimator is given by 

‘^k+l ~ T (ffc-l-l 

where bk+i and auxiliary matrices 0fe+i are updated according to Ck+i = ^khk+i- 
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Algorithm 4: ARNS-HMC 

Input: Initial estimator vo and auxiliary matrices <l>o, 0o, adaption schedule at, 
step size e, number of hidden units s 
Initialize the surrogate zq = z{q, vq) 
for f = 0,1, • • ■ do 

Resample momentum p 

Propose {q* ,p*) by simulating discretization of Hamiltonian dynamics with ^ 
Metropolis-Hasting correction: 

u ~ Uniform[0,1], p = — H{q*,p*)] 

if u < min(l,p), then = g*;; 

else 

update the estimator and auxiliary matrices to vt+i, 3>r+i, 0t+r using (g*-*"*"^^, 

u ~ Uniform[0,1], 

if u < at, then zt+i = z{q, vt+i);; 

else 2 t+i = 2 t; 


(i) Cfc+i = 0 

bk-yi = 1 uT -’ ‘J’fc+i = 0fc+i = 0fc — ©fc/ifc+i&fe+i 

(ti) Cfc-|_i 0 

^ k “I” 1 '~F^ 

^fc+i = n-112’ ^k-yi = - ^khk-yibi,_^_i 

l|cfe+iir 

0fc+i = Bfc ~ 0fc^fc+i^fc+i + (1 + h"[y.iQkhk-yi)bk-yibk+i — ^fe+i^fc+iBfe 

At time k, the estimator takes a one-off 0{kds -f ks^ -f s^) computation and 0{s^) storage (only need to 
store and 0^, not i/^). Starting at a previously computed solution vk = and two auxiliary 

matrices = I — , Qk = , this adaptive updating costs 0{ds -f s^) computation 

and O(s^) storage, independent of the training data size k. Further details are provided in Appendix]^ We 
refer to this extended version of our method as Adaptive RNS-HMC (ARNS-HMC). 

5 Experiments 

In this section, we use several experiments based on logistic regression models and inverse problem for 
elliptic partial differential equation (PDF) to compare our proposed methods to standard HMC and RMHMC 
in terms of sampling efficiency dehned as time-normalized effective sample size (ESS). Given B MCMC 
samples for each parameter, ESS = B[1 -f 2S^^-^'y{k)\~^, where is the sum of K monotone 

sample autocorrelations {Ml. We use the minimum ESS over all parameters normalized by the CPU time, s 
(in seconds), as the overall measure of efficiency: min(ESS)/s. The corresponding step sizes e and number 
of leapfrog steps L for HMC and RMHMC are chosen to make them stable and efficient (e.g., reasonably 
high acceptance probability and fast mixing). The same settings are used for our methods. Note that while 
the acceptance rates are similar in the hrst two examples, they drop a little bit for the last two examples, 
which is mainly due to the constraints we imposed on our surrogate functions. To prevent non-ergodicity 
and ensure high ESS for both HMC and RNS-HMC, we follow the suggestion by @| to uniformly sample 
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HMC RNS-HMC 

Figure 6: HMC vs RNS-HMC; Comparing one- and two-dimensional posterior marginals of 
/3i, /3ii, /32 i, /^ai, Pii based on the logistic regression model with simulated data. 


L from {1,..., L} in each iteration. The number of hidden nodes s in random network surrogates are not 
tuned too much and better results could be obtained by more careful tunings. 

In what follows, we first compare different methods in terms of their time-normalized ESS after the burin- 
period. To this end, we collect 5000 samples after a reasonable large number of iterations (5000) of burn-in 
to make sure the chains have reached their stationary states. For our methods, the accepted proposals during 
the burn-in period after a short warming-up session (the first 1000 iterations) are used as a training set for a 
shallow random network. Later, we show the advantages of our adaptive algorithm. 

5.1 Logistic regression model 

As our first example, we compare HMC, RMHMC, RNS-HMC, and RNS-RMHMC using a simulation study 
based on a logistic regression model with 50 parameters and N = 10® observations. The design matrix is 
X = and true parameters /3 are uniformly sampled from [0,1]®°, where Xi ^ ^ 49 ( 0 , Y^/ 49 ). 

The binary responses Y = (yi, j/2, • • ■, Vn)'^ are sampled independently from Bernoulli distributions with 
probabilities pi = 1/(1 -f exp(—a;f /3)). We assume /3 ^ A/ 5 o( 0 , IOO/ 50 ), and sample from the correspond¬ 
ing posterior distribution. 

Notice that the potential energy function U is now a convex function, the Hessian matrix is positive semi- 
definite everywhere. Therefore, we use the Hessian matrix of the surrogate as a local metric in RNS- 
RMHMC. For HMC and RNS-HMC, we set the step size and leapfrog steps e = 0.045, L = 6 . For 
RMHMC and RNS-RMHMC, we set the step size and leapfrog steps e = 0.54, L = 2. 

To illustrate that our method indeed converges to the right target distribution. Figure [^provides the one- and 
two-dimensional posterior marginals of some selected parameters obtained by HMC and RNS-HMC. Table 
[T] compares the performance of the four algorithms. As we can see, RNS-HMC has substantially improved 
the sampling efficiency in terms of time-normalized min(ESS). 

Next, we apply our method to two real datasets: Bank Marketing and the a9a dataset ll45]l . The Bank 
Marketing dataset (40197 observations and 24 features) is collected based on direct marketing campaigns of 
a Portuguese banking institution aiming at predicting if a client will subscribe to a term deposit ll44ll . We set 
the step size and number of leapfrog steps e = 0.012, L = 45 for HMC and RNS-HMC; e = 0.4, L = 6 for 
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Experiment 

Method 

AP 

ESS 

s/Iter 

min(ESS)/s 

spdup 

LR (Simulation) 

HMC 

0.76 

(4351,5000,5000) 

0.061 

14.17 

1 

RMHMC 

0.80 

(1182,1496,1655) 

3.794 

0.06 

0.004 

s = 2000 

RNS-HMC 

0.76 

(4449,4999,5000) 

0.007 

123.56 

8.72 

RNS-RMHMC 

0.82 

(1116,1471,1662) 

0.103 

2.17 

0.15 

LR (Bank Marketing) 

HMC 

RMHMC 

0.70 

0.92 

(2005,2454,3368) 

(1769,2128,2428) 

0.061 

0.631 

6.52 

0.56 

1 

0.09 

s = 1000 

RNS-HMC 

RNS-RMHMC 

0.70 

0.90 

(1761,2358,3378) 

(1974,2254,2457) 

0.007 

0.027 

52.22 

14.41 

8.01 

2.21 

LR (a9a 60 dimension) 

HMC 

RMHMC 

0.72 

0.82 

(1996,2959,3564) 

(5000,5000,5000) 

0.033 

3.492 

11.96 

0.29 

1 

0.02 

s = 2500 

RNS-HMC 

0.68 

(1835,2650,3203) 

0.005 

81.80 

6.84 

RNS-RMHMC 

0.79 

(4957,5000,5000) 

0.370 

2.68 

0.22 

Elliptic PDE 

HMC 

0.91 

(4533,5000,5000) 

0.775 

1.17 

1 

RMHMC 

0.80 

(5000,5000,5000) 

4.388 

0.23 

0.20 

s = 1000 

RNS-HMC 

0.75 

(2306,3034,3516) 

0.066 

7.10 

6.07 

RNS-RMHMC 

0.66 

(2126,4052,5000) 

0.097 

4.38 

3.74 


Table 1: Comparing the algorithms using logistic regression models and an elliptic PDE inverse problem. For each 
method, we provide the acceptance probability (AP), the CPU time (s) for each iteration and the time-normalized ESS. 


RMHMC and RNS-RMHMC. The a9a dataset (32561 features and 123 features) is complied from the UCI 
adult dataset 1461 which has been used to determine whether a person makes over 50K a year. We reduce 
the number of features to 60 by random projection (increasing the dimension to 100 results in a substantial 
drop in the acceptance probability). We set the step size and number of leapfrog steps e = 0.012, L = 10 
for HMC and RNS-HMC; e = 0.5, L = 4 for RMHMC and RNS-RMHMC. All datasets are normalized to 
have zero mean and unit standard deviation. The priors are the same as before. The results for the two data 
sets are summarized in Table 1. As before, both RNS-HMC and RNS-RMHMC significantly outperform 
their counterpart algorithms. 


5.2 Elliptic PDE inverse problem 


Another computationally intensive model is the elliptic PDE inverse problem discussed in ll42l |4^ . This 
classical inverse problem involves inference of the diffusion coefficient in an elliptic PDE which is usually 
used to model isothermal steady flow in porous media. Let c be the unknown diffusion coefficient and u be 
the pressure held, the forward model is governed by the elliptic PDE 


{c{x,6)V xu{x,e)) = 0, 


(13) 


where x = (xi, a; 2 ) C [0,1]^ is the spatial coordinate. The boundary conditions are 


u{x,e)\x^=o = Xi, u{x,l 


2 = 1 = 1 - Xl, 


du{x,0) 


dxi 


= 0 , 


du{x,6) 


x\—Q 


^x^ 


= 0 


Xi — 1 


In our numerical simulation, ( [T3] ) is solved using standard continuous GEEM with bilinear basis functions 
on a uniform 30 x 30 quadrilateral mesh. 


A log-Gaussian process prior is used for c(a;) with mean zero and an isotropic squared-exponential covari¬ 
ance kernel: 


C{xi,X2) = cr^exp ( - 


\Xl - X2\ 

2P 
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for which we set the variance — 1 and the length scale I = 0.2. Now, the diffusivity field can be easily 
parameterized with a Karhunen-Loeve (K-L) expansion; 


c(a;,6»)«exp 



where Ai and Vi{x) are the eigenvalues and eigenfunctions of the integral operator defined by the kernel 
C, and the parameter 0i are endowed with independent standard normal priors, 6i ^ A/^(0, 0.5^), which 
are the targets of inference. In particular, we truncate the K-L expansion at d = 20 modes and condition 
the corresponding mode weights on data. Data are generated by adding independent Gaussian noise to 
observations of the solution field on a uniform 11 x 11 grid covering the unit square. 


yj=u(xj,0) + ej, Cj - A/'(0,0.1^), j = l,2, ...,iV 


The number of leap frog steps and step sizes are set to be L = 10, e = 0.16 for both HMC and NNS-HMC. 
Note that the potential energy function is no longer convex; therefore, we can not construct a local metric 
from the Hessian matrix directly. However, the diagonal elements 



as = 0.5, ay = 0.1, i = 1,2,... ,d 


are highly likely to be positive in that the deterministic part (first two terms) is always positive and the noise 
part (last term) tends to cancel out. The diagonals of the Hessian matrix of surrogate therefore induce an 
effective local metric which can be used in RNS-RMHMC. A comparison of the results of all algorithms are 
presented in TableAs before, RNS-HMC provides a substantial improvement in the sampling efficiency. 
For the RMHMC methods, we set L = 3, e = 0.8. As seen in the table, RMHMC is less efficient than HMC 
mainly due to the slow computation speed. However, RNS-RMHMC improves RMHMC substantially and 
outperforms HMC. Although the metric induced by the diagonals of the Hessian matrix of surrogate may 
not be as effective as Fisher information, it is much cheaper to compute and provide a good approximation. 

Remark. In addition to the usual computational bottleneck as in previous examples, e.g., large amount 
of data, there is another challenge on top of that for this example due to the complicated forward model. 
Instead of a simple explicit probabilistic model that prescribes the likelihood of data given the parameter 
of interest, a PDF ( [T3| l is involved in the probabilistic model. The evaluation of geometrical and statistical 
quantities, therefore, involves solving a PDF similar to ( [T3| l in each iteration of HMC and RHMHC. This 
is a preventive factor in practice. Using our methods based on neural network surrogates provide a huge 
advantage. Numerical experiments show a gain of efficiency by more than 20 times. More improvement is 
expected as the amount of data increases. 

5.3 Adaptive learning 

Next, using the above four examples we show that ARNS-HMC can start with far fewer training points 
and quickly reach the same level of performance as that of RNS-HMC. Figure shows that as the num¬ 
ber of training points (from initial MCMC iterations) increases, ARNS-HMC fully achieves the network’s 
capability and reaches a comparable acceptance rate to that of HMC. 

We also compare ARNS-HMC to HMC and RNS-HMC in terms of the relative error of mean (RFM) which 
is defined as \\q{t) — E{q)\\ 2 /\\E{q)\\ 2 , where q{t) means sample mean up to time t. Figureshows the 
results using the four examples discussed above. Note that before training the neural network models, both 
RNS-HMC and ARNS-HMC are simply standard HMC so the three algorithms have similar performance. 
As we can see, ARNS-HMC has the best overall performance: it tends to provide lower RFM at early 
iterations. This could be useful if we have limited time budget to fit a model. 
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Eigure 7: Median acceptance rate of ANNS-HMC along with the corresponding 90% interval (shaded area). 
The red line shows the average acceptance rate of standard HMC. 


6 Discussion and Future Work 

In this paper, we have proposed an efficient and scalable computational method for Bayesian inference 
by exploring and exploiting regularity of probability models in parameter space. Our method is based on 
training surrogate function of the potential energy after exploring the parameter space sufficiently well. For 
situations where it is not practical to wait for a thorough exploration of parameter space, we have proposed 
an adaptive version of our method that can start with fewer training points and can quickly reach its full 
potential. 

As an example, we used random networks and efficient learning algorithms to construct effective surrogate 
functions. These random bases surrogate functions provide good approximations of collective information 
of the full data set while striking a good balance between accuracy and computation cost for efficient com¬ 
putation. Random networks combined with the optimized learning process can provide flexibility, accuracy, 
and scalability. Note that in general the overall performance could be sensitive to the architecture of the 
random network. Our proposed random network surrogate method scales differently than GP emulators 
because of the specihc constraints we imposed on its architecture. As our experimental results show, this 
approach could improve the performance of HMC in some applications. 
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Eigure 8; Relative error of mean as a function of running time. 


In its current form, our method is more effective in problems with costly likelihood and a moderate number 
of parameters. In spite of improvements we have made to standard HMC, dealing with high dimensional 
and complex distributions still remains quite challenging. Eor multimodal distributions, for example, our 
method’s effectiveness largely depends on the quality of training samples. If these samples are collected 
from one mode only, the surrogate function will miss the remaining modes and the sampler might not be able 
to explore them (especially if they are isolated modes). A surrogate function based on Gaussian processes 
might have a better chance at finding these modes in the tails of the approximate distribution since it tends 
to go to zero gradually. To address this issue, we can utilize mode searching and mode exploring ideas such 
as those proposed by B8ll49ll . Eor constrained target distributions, we can employ the method of ll50l based 
on Spherical HMC. 

For HMC, gradient of the potential function is an important driving force in the Hamiltonian dynamics. 
Although accurate approximation of a well sampled smooth function automatically leads to accurate ap¬ 
proximation of its gradient, this is not the case when the sampling is not well distributed. For example, when 
dense and well sampled training data sets are difficult to obtain in very high dimensions, one can incorporate 
the gradient information in the training process. In future, we will study more effective way to utilize this 
information in the training process. As a common practice in adaptive MCMC methods, one may also learn 
the mass matrix M adaptively together with the surrogate in ARNS-HMC. 
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A Convergence to the correct target distribution 

In order to prove that the equilibrium distribution remains the same, it suffices to show that the detailed balance condition 
still holds. Denote 6 = (p, q), 9' = (j)t{d). In the Metropolis-Hasting step, we use the original Hamiltonian to compute 
the acceptance probability 

a(9,9') = min(l, exp[—H{9') + H(9)]) 

therefore, 

a(9, 9')¥{d9) = a{9,9') exp[-H{9)]d9 

* min(exp[—77(0)],exp[—H’(6'')]) 

= a{9' ,9)iixp\-H{9')]d9' 

= a(9',9)¥{d9') 

since | = 1 due to the volume conservation property of the surrogate induced Hamiltonian flow rpt- Now that we 

showed the detailed balance condition is satisfied, along with the reversibility of the surrogate induced Hamiltonian flow, 
the modified Markov chain will converge to the correct target distribution. 



B Potential Matching 

In the paper, training data collected from the history of Markov chain are used without a detailed explanation. Here, we 
will analyze the asymptotical behavior of surrogate induced distribution and try to explain why the history of the Markov 
chain is a reasonable choice for training data. Recall that we find our neural network surrogate function by minimizing 
the mean square error dD- Similarly to 1351 . it turns out that minimizing o is asymptotically equivalent to minimizing 
a new distance between the surrogate induced distribution and the underlying target distribution, independent of their 
corresponding normalizing constants. 

Suppose we know the density of the underlying intractable target distribution up to a constant 

P{q\Y) ^ ^exp{-U{q)) 

where Z is the unknown normalizing constant. Our goal is to approximate this distribution using a parametrized density 
model, also known up to a constant, 

Q(<?> exp {-V{q, 9)) 

Ignoring the multiplicative constant, the corresponding potential energy functions are U{q) and V{q,9) respectively. 
The straightforward square distance between the two potentials will not be a well-defined measure between the two 
distributions distributions because of the unknown normalizing constants. Therefore, we use the following distance 
instead: 


K{9) = mm J \\V{q,9) - U{q) - df P{q\Y) dq 

= J \\Viq,9)-U{q)f Piq\Y)dq-[E,{V{9)-U)f =Ys.r,iV{9)-U) (14) 

Because of its similarity to score matching (35), we refer to the approximation method based on this new distance as 
potential matching', the corresponding potential matching estimator of 9 is given by 

9 = arg m^in K{9) 

It is easy to verify that K{9) — 0 ^ V{9) = U + constant => Q{q, 9) = P{q\Y), so K{9) is a well-defined squared 
distance. Exact evaluation of l|14[l is analytically intractable. In practice, given N samples from the target distribution 
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gi, § 2 , • • •, gjv, we minimize the empirical version of 01 

1 ^ 

K{e) = min - ^ \\V{q^, 9) - U{q..) - , 


iv / N 

n=l \ n=l 


(15) 


which is asymptotically equivalent to K by law of large numbers. 01 could be more concise if we allow a shift term in 
the parametrized model {V{q, 9) = V{q, 9-d) + 9d)- In that case, the empirical potential matching estimator is 

1 ^ 

9 = argmm.^(6») = argmmmm — ^ \\V{q„,9-d) + {9d - d) - U{qn)\\'^ 

n = l 

1 ^ 

= argmm — ^ \\V{qn,9-d) + 9d - U{q„)f 

n=l 
1 ^ 

= argmm — ^ \\V{qn,9) - U{q„)f 


In particular, if we adopt an additive model for V{q, 9) with a shift term 

S 

V{q,9) = '^Via{wi ■ q + di) + b, 9 = {v,b) 

i = l 

where Wi , di and activation function o are chosen according to Algorithm and collect early evaluations from the 
history of Markov chain 

Tn = (7(g(2))),..., {q^-^\U{q^^^))} 

as training data; this way, the estimated parameters (i.e., weights for the output neuron) are asymptotically the potential 
matching estimates 

lim = arg min lim C{9\Tn) = 9 

N^oo " e iV->-oo 

since the Markov chain will eventually converge to the target distribution. When tmncated at a finite N, the estimated 
parameters are almost the empirical potential matching estimates except that the samples from the history of the Markov 
chain are not exactly (but quite close) from the target distribution. 


C Adaptive learning 


Theorem 1 (Greville) If a matrix A, with k columns, is denoted by Ak and partitioned as Ak = [A^-i a^], with Ak-i 
a matrix having k — 1 columns, then the Moore-Penrose generalized inverse of Ak is 


where 


At _ CLkbk) 

— iT 

afe 


Ck = {I - Ak-iA\_f)ak, 



l + l|At_^afc|P ’ 


Cfe = 0 
Ck fQ 


Proof of Propositonj^ 

To save computational cost, we only update the estimator. Suppose the current output matrix is Hk+i 



and 


24 








the target vector is Tk+i = 


n 

tk+l 


, then 


= Tf+i ^ vl+i = Tf+i(//J+i)t 

= [tJ tk+i] [[Hi hk+i]) 

= [Tk tk+l] 


{Hiy{i-hk+ibi+,) 


^k+1 


= n 


[^k') (^ “ ^t=+i^A+i) + ifc+i^fc+i 


— Vk + (tk+i — Vk hk+i)bk+-i 

where &fe+i is obtained according to Theoremj^ Note that the computation of 6fc+i and Ck+i still involves Hk and hI 
whose sizes increase as data size grows. Following 140114111^ . we introduce two auxiliary matrices here 


= 7 - HliHiy G e, = [{HiyYiHiy = HliHl )t G 

and rewrite bk+i and Ck+i as 

- O l, 

Cfc+i = 0 


Cfc+i = ^khk+i, bk+i = < i+^fc+i®fc^fc+i 


Qfchfc 


l|C)j + l II 

Updating of the two auxiliary matrices can also be done adaptively 

TT\t 


Ck+i y 0 


$k+i = 7 - Hl+yHl+,y = 7 - [77,^' hk+i] 


iH^Yii-hk+ibi+y 

hi 


^k+\ 


= ^k — ^k.h 


kiT'k+lOk+1 


Qk+i = 77t^,(77j+i)^ = [{I-bk+ihl+yHl bk+i] 
and if Cfc+i = 0, these formulas can be simplified as below 


{HiYii-hk+ibi+y 

bl+i 


— {I—bk+ihk+i)Qk{I—hk+ibl^i)+bk+ibl^i 


4’fc+l — ^k, ©fc+l — 0fe — Qkhk+lbk+1 
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